Classification and regression trees for linguistic analysis
In this tutorial we’ll go over the basics of how to use classification, regression, and conditional inference trees, more generally referred to as decision tree models. The basic methods have been around for a while (see Breiman 1998), but they are (relatively) new to linguistics. One of the first papers to utilize this technique (and the related technique of random forests) in linguistics was Tagliamonte & Baayen (2012), but the tree & forest method has been gaining ground in recent years as a useful alternative to other methods (e.g. Bernaisch, Gries & Mukherjee 2014; Szmrecsanyi et al. 2016; Gries 2019; Deshors & Gries 2020).
In the first section I’ll explain a bit about how decision trees work, and then we’ll move on to discuss how to use them in R. Feel free to skip ahead to the CARTs in R section below.
Let’s get started!
What are classification and regression trees?
Decision tree models like classification and regression trees (CARTs) are similar in spirit to regression models in that they’re used to predict a response \(Y\) from a set of predictors \(X_1, X_2,..., X_n\). There are several packages in R for computing CARTs and we will briefly look at a couple of them here, namely the rpart and partykit packages.
There are generally two types of decision trees:
- Classification trees are used when you want to predict a categorical response. That is, the dependent variable can be assigned to 2 or more discrete classes, values, or labels.
- Regression trees are used when you want to predict a continuous response. That is, the dependent variable can take any of an infinite range of numerical values.
One of the major differences between tree models and standard regression models is that the latter are ‘global’ models, in the sense that the prediction formula is assumed to hold over the entire data space. This means that in the simple case of a regression formula without any interactions,
\[y = \beta_1x_1 + \beta_2x_2 + \beta_3x_3\]
the effect of predictor \(x_1\) on the response \(y\) is assumed to be the same no matter what the values of predictors \(x_2\) and \(x_3\) may be. This is not so with tree models.
The idea behind tree models is to divide the data space into smaller and smaller non-overlapping partitions, to which simpler models can be applied. Tree models use a top-down approach, in which we begin at the top of the tree where all observations are included in a single region and successively split the data space into new branches (subregions) down the tree. Most tree algorithms are ‘greedy’ in that they consider only the best split for the current region of the data. That is, they don’t care about what splits have come before, or what splits may come after the current node. Splitting continues until some threshold is reached, and how that threshold is defined can have a major impact on the results.
Terminology
A bit of terminology:
- Nodes: Points at which a splitting decision is made. Each node represents a (sub)section of the dataset. The Root node is the node at the top of the tree which represents the entire sample prior to any splitting.
- Branches: Subsections of the entire tree (a,so called sub-trees)
- Leaves / Terminal nodes: Nodes at the bottom of the tree where no further split is made.
- Parent / Child nodes: Super- and subordinate nodes are referred to as parent and child nodes respectively.
Decision tree terminology
In all decision trees, the leaves of the tree (terminal nodes) give us predictions about our response for the subregion of the dataset that the leaves represent.
- For classification trees, the leaves of the tree (terminal nodes) give us predictions about which class is mostly likely for that subregion of the data. This is measured simply as the class with the most observations, i.e. the class that makes up the largest proportion of the data.
- For regression trees, the prediction is just the mean value of the response for the observations in a given subregion. So if a data point falls into that region, our prediction for its response will be the average response in that region.
For this tutorial, we will focus on methods that use binary splits, though there are methods for producing trees with more than two branches, e.g. with J48() in the RWeka package.
An illustration
Suppose we have a dataset of hypothetical speakers from three different regions, A, B, or C. We measured the proportions of two linguistic features used by these participants, feature F1 and feature F2, and we want to see how well we can predict a participant’s region based on F1 and F2. The values of F1 and F2 are proportions ranging from 0 to 1.
We plot the regions based on F1 and F2 and get this:
We can model this with a decision tree like so.
There are five terminal nodes in the resulting tree, which represent distinct partitions (subsets) of the data. These can be represented in the scatterplot accordingly:
For each partition of the data, the model makes a prediction based on the proportion of A, B, or C participants found in that partition. Simply put, the most frequent region found in the data partition is the winner. To get a prediction for a participant’s region then, we simply find the F1 and F2 values of that participant, locate the partition of the data they fall into, and get the most frequent response in that partition.
Splitting criteria
One of the most important factors affecting the accuracy of tree models is the method that the tree-growing algorithm uses to determine where and when to make a split. The idea is to find the “best” split at any given point, so the question becomes how to determine the “best” split. Decision trees use different algorithms to find the best split, and I’ll briefly mention the most common here. But first, a bit more terminology…
When talking about trees, we often talk about the homogeneity or purity of a given partition of the data with respect to the distribution of the outcome variable. If you consider the plots below, it’s fairly easy to see which plot is the most homogeneous or “pure.” The goal is to end up with partitions at the bottom of the tree that are as homogeneous as possible.
Conversely, we can describe a partition in terms of its impurity, in which case we can think of splitting a node as an attempt to split the data in a manner that minimizes the impurity in the resulting partitions. Splitting is thus driven by the goal of impurity reduction. Again, the aim is to get partitions that are as homogeneous as possible.
For classification trees, there are several splitting methods available. Two of the most common ones are Gini impurity and information gain. The latter is an information-theoretic measure based on the concept of entropy reduction.
For regression trees, the standard method relies on finding the split that results in the greatest reduction in variance when comparing the variance of the parent node to the average variance among the child nodes.
See (Breiman1998?) for details, or look at the discussion of decision tree metrics in Wikipedia.
CARTs in R
Libraries
Before starting, make sure you have the following packages installed, and load them into your workspace. The code here makes use of the standard {tidyverse} packages and functions. You can find more information via the Tidyverse website, or through the much more extensive R for Data Science book, which is also available online. I also use the {here} package for managing file paths in my projects.
# install and load
library(here)
library(tidyverse)For the trees and forests, you’ll need to install and load the following packages.
# install.packages(c('rpart', 'partykit', 'languageR'))
library(rpart)
library(partykit)
library(languageR)Datasets
For for classifying we’ll use two datasets from studies of syntactic alternations in English. We’ll load the datasets directly from my GitHub repository.
# English genitives ('s vs. of)
gens <- read_delim("https://raw.githubusercontent.com/jasongraf1/Bham_Stats_Summer_School_2022/main/data/brown_genitives.txt",
delim = "\t", trim_ws = TRUE, col_types = cols())
# English relativizers (that vs. which vs. ZERO)
rels <- read_delim("https://raw.githubusercontent.com/jasongraf1/Bham_Stats_Summer_School_2022/main/data/brown_relativizers.txt",
delim = "\t", trim_ws = TRUE, col_types = cols())Genitive alternation
This dataset contains data from 5 sections of the Brown corpus (wikipedia) used in (Grafmiller2014?), as well as a complementary dataset from the Frown corpus.
- the president’s assertion [s-genitive]
- the contour of her face [of-genitive]
The English genitive alternation is known to be correlated with a number of features, including…
- the animacy of the possessor
- the length of the possessor and possessum
- the presence of a sibilant at the end of the possessor (Bush’s prestige)
- semantic relation between possessor and possessum (kinship, ownership, body-part, etc.)
- the ‘thematicity’ (text frequency) of the possessor
- the genre of the text (Newspaper, academic, fiction, etc.)
The goal of this study was to investigate the factors that co-determine (or at least correlate with) the choice of genitive variant.
# inspect the data
head(gens)English Relativizers
This is the dataset of English relativizers originally compiled by Hinrichs et al. (Hinrichs2015?) and also used in Grafmiller et al. (Grafmiller2016?).
- a doctrine that nobody either does or need hold
- mistakes which others manage to avoid
- the boundary conditions ___ we impose
The choice among relative pronouns in English is known to be correlated with a number of features, including…
- the length of the relative clause
RCLn - the length of the antecedent NP (both this and the above are related to the complexity of the RC context) `
- the part of speech of the antecedent
- the number of the antecedent
- the formality of the text
- prior use of a particular pronoun (‘structural persistence/priming’)
- the predictability of an upcoming RC, given the preceding material
Like the genitive case, the goal of these studies was to investigate the factors that co-determine/correlate with the choice of relativizer, particularly that vs. which vs. ZERO
# inspect the data
head(rels)English lexical decision times
For illustrating regression trees we’ll use a dataset that comes from the languageR package (Baayen 2008) and gives mean visual lexical decision latencies and word naming latencies for 2284 monomorphemic English nouns and verbs, averaged for old and young subjects, with various other predictor variables.
english <- languageR::english %>%
as_tibble()
head(english)You can find more information about this dataset by consulting the documentation with ?languageR::english.
Classification trees
Classification trees are similar to logistic regression in that they are used to predict the probability of a set of two or more possible responses or outcomes. This is perhaps the most intuitive use of tree models.
A simple tree with rpart()
Let’s start by fitting a simple classification tree model and plot it. We’ll consider the effect of possessor animacy (as a binary variable) and final sibilant on the choice of genitive variant.
First we’ll define the formula for quick calling. Here we are trying to predict the Type of genitive construction based on whether the possessor is animate or not (Possessor.Animacy2), and whether the possessor ends in a sibilant sound (Final.Sibilant).
gen_fmla1 <- Type ~ Possessor.Animacy2 + Final.SibilantWe’ll start with the rpart() function for creating trees, found in the rpart package. We’ll fit the model and plot it. See ?plot.rpart for help.
gen_rpart1 <- rpart(gen_fmla1, data = gens, method = "class")
plot(gen_rpart1, uniform = T, branch = 0.8, margin = 0.1)
text(gen_rpart1, all = T, use.n = TRUE)
title("Genitives rpart tree 1: Gini")The default splitting criterion is “gini,” but We can try to use a different one. Generally with binary outcomes the two methods almost always yield the same results.
gen_rpart1b <- rpart(gen_fmla1, data = gens, method = "class", parms = list(split = "information"))
plot(gen_rpart1b, uniform = T, branch = 0.8, margin = 0.1)
text(gen_rpart1b, all = T, use.n = TRUE)
title("Genitives rpart tree 1: Information")Making nice(r) plots
Unfortunately the basic plotting functions for rpart are rather ugly and unhelpful, but there are ways to make the output nicer. One way is to convert the tree to a party object with as.party() from the partykit package (more on this below).
plot(as.party(gen_rpart1))This output is much clearer.
We see that initially, the best split is between animate and inanimate possessors. Then, we look at the corresponding subsets and see that the presence of a final sibilant also has an effect. However, the it seems that the effect is limited to only the subset of the data in which the possessor is animate, as indicated by the presence of a split on the right, but not on the left.
The rattle package makes nice graphs as well with the fancyRpartPlot() function.
rattle::fancyRpartPlot(gen_rpart1)More complex trees
Formula for genitive choice using more predictors. These are mostly categorical predictors, so the model should be relatively simple, we hope.
gen_fmla2 <- Type ~ Possessor.Animacy2 + Final.Sibilant + Possessor.Length + SemanticRelation +
Possessor.Expression.Type + Genre + Corpus + Possessum.Length + PossessorThematicityNow we fit a tree using this formula.
gen_rpart2 <- rpart(gen_fmla2, data = gens)
plot(gen_rpart2, uniform = T, branch = 0.8, margin = 0.1)
text(gen_rpart2, all = T, use.n = TRUE)
title("Genitives rpart tree 2: Information")The partykit package also has a useful function as.party() for making nicer-looking trees.
# create a party plot with the pipe operator
as.party(gen_rpart2) %>%
plot(gp = gpar(cex = 0.7))This is a sign that our model may not be the best model for the purposes of learning something about the larger population our data is sampled from. The problem is that the tree-building algorithm will look for any and all possible splits in the data, regardless of whether those divisions are likely to be replicated if we were to take different samples of genitive constructions. This is the problem of overfitting: our tree model is too finely tuned to our specific dataset, thus it does not likely represent a good model of genitive variation in general.
Regression trees
With regression trees, the value shown on the leaf nodes is the expected value of the response for the given partition of the data. We’ll look at the effects of a number of predictors on lexical decision latencies in a word recognition task. We’ll consider the age of the subject (AgeSubject = ‘old’ or ‘young’), written frequency of the word, and the syntactic category (‘N’ or ‘V’) of the word.
tree_fmla <- RTlexdec ~ AgeSubject + WordCategory + WrittenFrequencylexdec_rpart1 <- rpart(tree_fmla, data = english)
plot(lexdec_rpart1, uniform = T, branch = 0.8, margin = 0.1)
text(lexdec_rpart1, all = T, use.n = TRUE)
title("Lexical decision times")as.party(lexdec_rpart1) %>%
plot()rpart::plotcp(lexdec_rpart1) So if we take only those cases where the subject is old (the branch on the right) and the
WrittenFrequency is less 3.961 (Node 11), and then calculate the mean value of RTlexdec for that subset of the data, we get 6.758. This is the predicted response time for old subjects responding to words with a written frequency less than 3.961.
subset(english, AgeSubject == "old" & WrittenFrequency < 3.961) %>%
.$RTlexdec %>%
mean()[1] 6.758294
So what we can conclude from this tree is that age and written frequency seem to have the biggest effect on decision times, while word category has relatively little impact.
We can examine the relationship between these variables with plots like those below, where it’s fairly clear that age has a big effect.
ggplot(english, aes(WrittenFrequency, RTlexdec)) + geom_point(aes(col = WordCategory),
alpha = 0.5) + geom_smooth(method = "loess", color = "black") + geom_smooth(method = "lm",
color = "red") + ggtitle("Lexical decision time by frequency and category") +
scale_color_discrete(name = "") + theme(legend.position = "bottom")ggplot(english, aes(WrittenFrequency, RTlexdec)) + geom_point(aes(col = AgeSubject),
alpha = 0.5) + geom_smooth(method = "loess", color = "black") + geom_smooth(method = "lm",
color = "red") + ggtitle("Lexical decision time by frequency and age") + scale_color_discrete(name = "") +
theme(legend.position = "bottom")Conditional inference trees
Conditional inference trees (‘ctrees’) are another type of decision tree model. Ctrees are similar to traditional CARTs, except that the method is designed explicitly to avoid known biases of such methods which tend to favor variables that have many possible splits or many missing values (Hothorn2006?).
In more technical terms, conditional inference trees derives splits using a permutation significance test procedure “in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points” (from wikipedia).
For more detailed discussion of the how conditional inference trees work, see Hothorn, Hornik & Zeileis (2006).
For this tutorial, we’ll use the {partykit} package, which is an extension of the earlier {party} package (do NOT load both at the same time). It’s important to note that both packages use functions with the same names, most importantly ctree(), cforest(). If you load both packages into your workspace, you will see warnings about certain functions being masked. This means that the function from the first package that was loaded will no longer be called by default. For this reason, it’s usually a good idea to load only one of these package at a given time. Alternatively, you can call functions from specific packages explicitly with the syntax package::function()
Before using ctree() we need to make sure the columns of the dataframe are in the necessary format. ctree() doesn’t work with character vectors.
str(gens[, all.vars(gen_fmla1)])tibble [5,098 x 3] (S3: tbl_df/tbl/data.frame)
$ Type : chr [1:5098] "of" "of" "of" "of" ...
$ Possessor.Animacy2: chr [1:5098] "animate" "animate" "animate" "animate" ...
$ Final.Sibilant : chr [1:5098] "N" "Y" "N" "N" ...
One thing to look out for is that {party} and {partykit} don’t work with character vectors, so we’ll need to convert our columns here to factors.
all.vars(gen_fmla2) [1] "Type" "Possessor.Animacy2"
[3] "Final.Sibilant" "Possessor.Length"
[5] "SemanticRelation" "Possessor.Expression.Type"
[7] "Genre" "Corpus"
[9] "Possessum.Length" "PossessorThematicity"
gens <- gens %>%
mutate(across(all.vars(gen_fmla2)[c(1:3, 5:8)], as.factor))Simple conditional inference tree
Fit a ctree model and plot it. We’ll consider again the effect of possessor animacy (as a binary variable) and final sibilant on the choice of genitive variant.
gen_ctree1 <- ctree(gen_fmla1, data = gens)
plot(gen_ctree1)We see from Node 1 that initially, the best split is between animate and inanimate possessors. Then, we look at the corresponding subsets and see that the presence of a final sibilant also has an effect, in both the animate and inanimate subsets.
plot(gen_ctree1, inner_panel = partykit::node_barplot(gen_ctree1))We can generate predictions from the tree with the predict() function.
ctree_predict <- predict(gen_ctree1)
# proportion of observations correctly predicted by tree:
sum(gens$Type == ctree_predict)/nrow(gens)[1] 0.7826599
This is a considerable improvement over the baseline, i.e. the percentage of observations we’d correctly predict if we simply guessed the more frequent response (of-genitive) every time.
max(table(gens$Type)/nrow(gens))[1] 0.6086701
Predict multiple responses
Let’s try to predict English relativizers with a simplified model, looking at only the object RCs (the book that/which/∅ you wrote).
rels <- rels %>%
mutate(across(c(2:4, 6:8, 12, 15, 42), as.factor))objrels <- rels %>%
filter(relFct == "Obj") %>%
droplevels()
rc_ctree1 <- ctree(rel ~ variety + time + antPOS2, data = objrels)
plot(rc_ctree1)You’ll probably find that the default plot is difficult to read. We’ll go over how to deal with this shortly.
Pruning and tuning
Tree models are also prone to overfitting the data. Consider the tree below, based on a much more complex model of genitive choice. This tree is far too complicated to interpret in any useful way.
gen_ctree2 <- ctree(gen_fmla2, data = gens)
plot(gen_ctree2, gp = gpar(cex = 0.7))This is a sign that our model may not be the best model for the purposes of learning something about the larger population our data is sampled from. The problem is that the tree-building algorithm will look for any and all possible splits in the data, regardless of whether those divisions are likely to be replicated if we were to take different samples of genitive constructions. This is the problem of overfitting: our tree model is too finely tuned to our specific dataset, thus it does not likely represent a good model of genitive variation in general.
Adjusting for overfitting can be done post-hoc by “pruning” the tree, or it can be done prior to fitting the model by tuning the control settings of the function. Pruning is important for CARTs fit with tree and rpart, but it is generally not something that is done with ctrees (indeed, the technique was designed partly to make pruning unnecessary). What we should do instead, is decide how to control the growth of the tree beforehand, in an unbiased way. There are a number of ways to do this.
Adjusting p-value
One way is to adjust the level of statistical significance the ctree requires a test to meet before making a split. Splits are only created when a global null hypothesis cannot be rejected, as determined by a chosen p-value (by default, this is the standard α = 0.05). You can adjust this under the ctree_control() options.
Let’s set it to require a p-value of 0.001 or below:
gen_ctree2b <- ctree(gen_fmla2, data = gens, control = ctree_control(mincriterion = 0.999)) # now p < .001
# Note the value for 'mincriterion' is 1 minus the desired level: 1 - .001 =
# .999
plot(gen_ctree2b)The tree is simpler now, but still pretty unmanageable.
Limiting branching depth
Another way to simplify the tree is to limit the maximum depth of the branching with the maxdepth argument. The default is maxdepth = Inf, but you can tell it to stop splitting after a certain number of levels is reached.
gen_ctree2c <- ctree(gen_fmla2, data = gens, control = ctree_control(mincriterion = 0.999,
maxdepth = 3))
plot(gen_ctree2c)This is better, but we may be losing some predictive power. The tree is likely to make less accurate predictions, but it is at least interpretable. Other options can be adjusted as needed/desired. These include the number of data points necessary to consider splitting (minsplit = if there are fewer data points than this value, the model will not try to find any split) and the minimum number of data points that the resulting subsets must contain (minbucket = if a given split results in one or more partitions with fewer data points than this value, the model will ignore this split)
gen_ctree2d <- ctree(gen_fmla2, data = gens, control = ctree_control(mincriterion = 0.999,
minsplit = 1000L))
plot(gen_ctree2d)gen_ctree2e <- ctree(gen_fmla2, data = gens, control = ctree_control(mincriterion = 0.999,
minsplit = 1000L, minbucket = 200L))
plot(gen_ctree2e)See the documentation for ctree and ctree_control for more.
Graphical parameters
Large trees can be unwieldy to plot, and the partykit package offers ways to adjust the graphical settings of your trees to help with presentation.
help("party-plot")Basic global parameters are specified with a gpar() object.
`?`(gpar)You can see the current settings like so.
str(get.gpar())List of 14
$ fill : chr "white"
$ col : chr "black"
$ lty : chr "solid"
$ lwd : num 1
$ cex : num 1
$ fontsize : num 12
$ lineheight: num 1.2
$ font : int 1
$ fontfamily: chr ""
$ alpha : num 1
$ lineend : chr "round"
$ linejoin : chr "round"
$ linemitre : num 10
$ lex : num 1
- attr(*, "class")= chr "gpar"
Fonts
Font size, face (bold, italic), and family (serif, Times, etc.) can be changed with the following parameters.
cex: Multiplier applied to fontsizefontsize: The size of text (in points)fontface: The specification of fontface can be an integer or a string. If an integer, then it follows the R base graphics standard: 1 = plain, 2 = bold, 3 = italic, 4 = bold italic. If a string, then valid values are: “plain,” “bold,” “italic,” “oblique,” and “bold.italic.”fontfamily: Changes to the fontfamily may be ignored by some devices. The fontfamily may be used to specify one of the Hershey Font families (e.g., HersheySerif) and this specification will be honoured on all devices.
# (cex is a multiplier)
plot(gen_ctree1, gp = gpar(cex = 0.8, fontfamily = "Times New Roman"))plot(rc_ctree1, gp = gpar(cex = 0.8, fontface = "italic"))Line type & width (lex is a similar multiplier)
Line type (solid, dashed, etc.) and width can be adjusted with the lty and lwd or lex arguments respectively (lex is a multiplier similar to cex).
Line type can be specified using either text (“blank,” “solid,” “dashed,” “dotted,” “dotdash,” “longdash,” “twodash”) or number (0, 1, 2, 3, 4, 5, 6). Note that lty = “solid” is identical to lty = 1.
plot(gen_ctree1, gp = gpar(cex = 0.8, lty = 2)) # dashed linesplot(gen_ctree1, gp = gpar(cex = 0.8, lwd = 2))Colors
plot(gen_ctree1, gp = gpar(cex = 0.8, col = "blue"))Changing panels
The panels and edges themselves can be formatted as well (as we saw above)
`?`(panelfunctions)plot(gen_ctree1,
inner_panel = node_barplot, # create tree with inner panels as parplots
ip_args = list(id = T), # remove IDs from inner panels
tp_args = list(fill = c("palegreen4", "palegreen1"))
)plot(rc_ctree1,
gp = gpar(cex = .8),
ip_args = list(id = F, pval = F), # remove IDs and pvals from inner panels
tp_args = list(id = F)
)# color bars
plot(rc_ctree1, tp_args = list(fill = heat.colors(3)))A word of caution
Decision tree models seem quite useful, however it is easy to be led astray by them. For example it’s often assumed that trees are good at representing interaction effects, but there are cases in which this assumption cannot be maintained. This is sometimes referred to as the XOR problem,
which describes a situation where two variables show no main effect but a perfect interaction. In this case, because of the lack of a marginally detectable main effect, none of the variables may be selected in the first split of a classification tree, and the interaction may never be discovered. (Strobl, Malley & Tutz 2009:341)
A recent similation study by Gries (2019) illustrates this problem quite nicely. When we don’t know the actual relationship between our predictors and the outcome, we should be extra careful about making claims regarding (the absence of) interactions in the data.
A second word of caution about trees, which to my knowledge has not been raised in the literature, involves the use of splits to identify inflection points (non-linearities) in continuous predictors. For instance, Tagliamonte, D’Arcy & Louro (2016) use ctrees to identify what they refer to as “shock points” in the developmental timeline of the English quotative system, specifically the rapid global increase in the use of quotative like. They illustrate this trend with figures such as the one below:
Tagliamonte et al. (2016:833)
They argue that trees like this reveal the points in time where the trajectory of quotative like significantly changed from previous years, where it’s use began to accelerate (or decelerate), and these points in time naturally open themselves up to interpretation. Presumably there is some reason why this variable shows substantial changes at these particular times. But we might also wonder: given that ctree models are designed to make just these kinds of binary partitions in the data, shouldn’t we expect a
To put it another way, how would a decision tree, constrained to make binary partitions in the data, represent a truly linear effect of a predictor x on some outcome y, such as illustrated below?
There is clearly no obvious curvature in the data here, so what is likely to happen when if we fit a tree model to such a dataset? How many splits do we get? How are they distributed? Where might the tree split, and why?1
We can test this with a small simulation study. Doing so reveals how branching in the trees can suggest non-linearities in a misleading way. We’ll simulate a dataset predicting choice of quotative like vs. the standard say nased on the date of birth (DOB) of a speaker.
set.seed(43214)
# simulate a dataset
DOB <- ceiling(rnorm(200, 1960, 15))
y <- scale(DOB) * 1.5 + rnorm(200, 0, 2) # add some noise
# y is a set of outcome probabilities on the log odds scale
# We'll use log odds because they can range from -Inf to Inf
df <- data.frame(DOB = DOB, y = y) %>%
mutate(
prob = gtools::inv.logit(y), # convert log odds to probabilities
resp = factor(if_else(prob > .5, "like", "say")),
bin = as.numeric(resp) - 1
)
summary(df) DOB y prob resp
Min. :1920 Min. :-6.29318 Min. :0.001845 like:101
1st Qu.:1950 1st Qu.:-1.95962 1st Qu.:0.123523 say : 99
Median :1960 Median : 0.06341 Median :0.515842
Mean :1961 Mean :-0.07830 Mean :0.488721
3rd Qu.:1971 3rd Qu.: 1.64026 3rd Qu.:0.837569
Max. :2006 Max. : 5.69999 Max. :0.996665
bin
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.495
3rd Qu.:1.000
Max. :1.000
From a conditional density plot, it’s clear that the effect is about as linear as we could want. There is very little curvature in the line dividing the two halves of the plot.
par(mar = c(5, 4, 4, 2)) # increase top margin
cdplot(resp ~ DOB, df, main = "CD plot of simulated linear effect of DOB\non use of quotative 'like' over 'say'")But, if we fit tree models to the data, they nonetheless suggest some “shock points,” which we might be tempted to interpret as meaningful in some way.
ctree(resp ~ DOB, df) %>%
plot(main = "Ctree simulated linear effect of DOB\non use of quotative 'like' over 'say'")Other tree methods fair much the same (try these for yourself).
rpart1 <- rpart(resp ~ DOB, df)
plot(rpart1)
text(rpart1)tree1 <- tree(resp ~ DOB, df)
plot(tree1)
text(tree1)The same problem applies to regression problems (when the outcome is continuous). Consider the first plot above. A quick regression model shows this is a significant effect.
summary(lm(y ~ DOB, df))
Call:
lm(formula = y ~ DOB, data = df)
Residuals:
Min 1Q Median 3Q Max
-5.1049 -1.3450 0.0143 1.2515 6.1491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -204.139961 17.730391 -11.51 <2e-16 ***
DOB 0.104078 0.009043 11.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.951 on 198 degrees of freedom
Multiple R-squared: 0.4009, Adjusted R-squared: 0.3978
F-statistic: 132.5 on 1 and 198 DF, p-value: < 2.2e-16
But again, a tree model suggests spurious “shock” points.
ctree(y ~ DOB, df) %>%
plot(main = "CD plot of simulated linear effect of DOB on continuous outcome")The lesson here is that conditional inference trees are capable of capturing linear effects, to a certain degree, but they are constrained in ways that other methods are not. We should therefore be cautious when trying to read too much the individual split points of continuous predictors derived solely from a tree. It’s always a good idea to verify such patterns using other techniques, such as condition density plots for categorical outcomes or simple scatterplots for continuous outcomes. If you don’t see much of a pattern in these plots, you probably should not make much of the specific split points in your tree models.
1Note: My aim here is not to criticize the work of Tagliamonte et al. (2016)—indeed the effects they find appear to be quite robust and genuinely non-linear, which we’d expect of the usual s-curve patterns observed in language change. In fact, if you go back to their data and plot them with a CD plot, it is clear that the “shock” points they observe in their trees are likely very real.
Summary
Advantages of conditional inference trees:
- Non-parametric. Tree models don’t make any assumptions about the distribution of the data, which means they require very little data preparation (unlike parametric methods such as regression).
- Computationally quick and simple. They can work on very large datasets in reasonable amounts of time.
- Easy to understand. In simple cases, trees are relatively easy to understand and interpret even for people without much background in statistics. [But see below…]
Disadvantages of conditional inference trees:
- Overfitting. Tree models tend to overfit the data, and require pruning or tuning.
- Sensitive to particularities of your data. This problem is similar in spirit to overfitting. Slight changes can result in different trees, which can give different predictions. This makes the results of the tree less likely to generalize to new data.
- Prone to misinterpretation. Trees with many interacting predictors do not always accurately represent the true patterns in the data (Gries2019?). Trees can suggest spurious non-linearities and inflate effects of individual predictors.
- Overly complex trees. With large datasets and many predictors, we often get very large trees that are difficult to interpret.
- Low accuracy. Trees are generally less accurate than other methods, e.g. regression.
- Not ideal for continuous data. Tree models tend to lose too much information when trying to model continuous outcomes. In these cases, regression models are usually preferred.
For further reading see chapter 14 in Levshina (2015) and chapter 9.2 in Hastie, Tibshirani & Friedman (2009).